﻿\chapter{RARECODONS}

\section{Abstract}

Biologists commonly consider synonymous mutations--those mutations which alter
a DNA sequence, but not the amino acid sequence in the corresponding protein--
to be functionally neutral. The rareness of particular codon synonyms varies
between organisms\cite{thing1}. Previous work has shown that common codons are
translated more quickly than their rarer synonyms, implying that protein
production can be made more efficient by increasing the usage of common
codons\cite{thing2}.  Due to this selective pressure, rare codons were assumed
to be randomly distributed consequences of genetic drift\cite{thing3}. However,
by quantifying the relative rarity of codons, our collaborators were able to
show that rare codons cluster more than is expected by random chance across a
diverse section of the tree of life\cite{thing4}.  Our biological collaborators
hypothesize that rare codon clusters have functional effects. If this were the
case, we would expect to find evidence of rare codon cluster conservation in
orthologous genes across organisms. In this project, we develop a pipeline to
quantify codon rarity and the conservation thereof in alignments of orthologous
genes.

\section{Introduction}
%How do we test this hypothesis?
%Find orthologs from diverse genomes
%align them with one another
%calculate %min-max for each sequence in alignemtn
%calculate a significance threshold and splice locations of significant rare codon clusters into
%multiple alignment
%test to evaluate the likelihood of significantly rare codon clusters aligning by chance

%Let's talk about biological background
%1. Nucleotide sequence is degenerate
%2. Different Codons appear at different frequencies
%3. Codon rareness is a proxy for translation frequency (Translation figure)

Each amino acid is coded by a nucleotide triplet known as a codon.  However, there are only
20 amino acids and 64 possible codons, 3 of which are commonly reserved as "stop" codons indicating
the end of a coding sequence.  This degeneracy allows for synonymous coding for certain amino acids.
The most common coding table is shown in Figure \ref{fig:codontable}.

\begin{figure}
\includegraphics[width=5.5in]{figures/codon_table.pdf}
\caption{Standard codon table}
\label{fig:codontable}
\end{figure}

Previously, researchers have shown that rare codons are translated more slowly
than their more common synonyms \cite{research}. Though the correlation is
imperfect, this allows us to use codon-rarity as a proxy for translation rate.
Translation is achieved in the ribosome by binding tRNA to mRNA to produce
amino acid chains as conceptualized in Figure \ref{fig:translationdiagram}.
Rare codons tend to have fewer matching tRNAs, slowing the incorporation of
amino acids into the protein, and increasing the rate of error.

\begin{figure}
\includegraphics[width=5.5in]{figures/Translation_diagram.pdf}
\caption{The ribosome binds tRNA to mRNA to produce amino acid chains--a process known as translation.  Rare codons have fewer tRNAs, slowing the incorporation of amino acids into the protein.}
\label{fig:translationdiagram}
\end{figure}

Our collaborators came up with an algorithm to characterize the rareness of
given stretches of sequence controlling for amino acid content \cite{minmax}.
Figure \ref{fig:minmaxalgorithm} diagrams their algorithm for performing this
calculation.

\begin{figure}
\includegraphics[width=5.5in]{figures/min-max-algorithm.pdf}
\caption{The \%min-max algorithm for determining the amino acid constrained
codon rarity of sequence windows}
\label{fig:minmaxalgorithm}
\end{figure}

Using this method, they established strikingly non-random clustering of codon
rarity across many different organisms, as shown in Figure
\label{fig:rarecodonscluster}. 

If these clusterings have functional effects--e.g. through cotranslational
folding-- then we would expect them to be conserved across orthologs.  We find
aligned regions of rareness at far greater frequency than would be expected by
random chance.  These results hold after controls for amino acid sequence,
GC-content, and codon-pair bias. %I hope
The software produced provides a framework to evaluate the significance of 
alignments of regions of interest in orthologs, as well as many useful tools
to produce appropriately constrained random control sequences on distributed
computing resources.

\begin{figure}
\includegraphics[width=5.5in]{figures/rare-codons-cluster.pdf}
\caption{Rare codons cluster}
\label{fig:rarecodonscluster}
\end{figure}

\section{Methods}
%this bit applies to both
\subsection{pre-processing}
We identify orthologs using the Orthomcl tool and align those orthologs with
the MUSCLE multiple alignment tool\cite{thing5,thing6}.  We then use our
implementation of the \%min-max algorithm to identify rare codon clusters in
each aligned sequence.  

%original methods
We establish a significant level of \%min by comparing gene results to 200
random reverse translations of the gene's amino acids using the organisms'
global codon usage. 

\subsection{Peak Method}
Our initial implementation focused on identifying aligned windows of rareness
at a per-position basis.  In this method, codons with \%min below the
threshhold that are also minimal within a 17 codon window were identified as
``peaks". By splicing the distance from the nearest peak into each sequence's
position in the multiple alignment, we enable exact calculation of the
probability of the level of peak correspondence at each position. Figure
\ref{fig:peakfile} shows an example file in this format.  The processing
workflow for this method is shown in Figure \ref{fig:rccarchitecture}.

%original workflow

\begin{figure}
\includegraphics[width=5.5in]{figures/RC-Clust_Architecture.pdf}
\caption{Original pipeline to evaluate significance of rare codon cluster alignment}
\label{fig:rccarchitecture}
\end{figure}

\subsection{Poisson Sampling}
%modify to talk about peaks, rather than positions

Our initial probability calculation makes a few simplifying assumptions--most
importantly that each rare codon peak in a given sequence is an independent
random event, and is representative of a rare codon cluster. Using this
assumption enables us to utilize Poisson Sampling as shown in figure
\ref{fig:poissonsampling}.  This calculation is exponential in the number of
aligned sequences, but can be performed in parallel for each position in the
alignment (rows in the provided figure).  %move (rows in the provided figure)
to caption perhaps?

\begin{figure}
\includegraphics[width=5.5in]{figures/Bernoulli_Sampling.pdf}
\caption{Poisson Sampling}
\label{fig:poissonsampling}
\end{figure}

\subsection{Limitation of Poisson}

Though the Poisson Sampling strategy produced multiple positions with low
p-values, it suffered from several important conceptual limitations. The method
splits clusters of codons into multiple ``independent" parts, though in
practice this does not reflect the biological linkage of the split clusters.
This splitting increases the total number of interesting events in each
alignment, diluting the p-values through false proliferation.  Furthermore, it
is difficult to determine the appropriate null model to correct
multiple-sampling problems with the calculated p-values. Standard estimates of
the expected number of significant events predicted far more correspondences
than were present in the dataset.  

Finally, as the distributed computing strategy developed only exploits
parallelism at the granularity of alignment columns, large ortholog groups can
still exceed practical computational limits.  Approximations of the method
could be used to work around these limits at the price of precision.  

\subsection{Holistic Evaluation of Rare Codon Co-Occurrence}

The challenges with the "peak" strategy led us to consider metrics and methods
for simply evaluating the likelihood that a given orthology group would show
the observed amount of rare-codon co-occurrence by chance alone.  To do so, we
established or borrowed a variety of measures for comparing distributions and
applied them to samples and randomized populations.  We calculate significance
by comparing to two random controls: one holding rare codon cluster presence
constant, and the other controlling for protein sequence, GC-enrichment
, and codon pair bias.

\subsection{Comparison Measures} 

Intuitively, we are interested in orthology groups exhibiting more overlap of
predominantly rare regions than can be explained by random chance. We characterize
the degree of overlap through a histogram of position counts with overlaps of varying
depth. We compute several measures on these histograms to summarize the extent
of overlap in a given group as compared to random shuffles of the rare codons
in the same alignment.

All of the measures operate by comparison of a sample histogram S to a random
histogram R, where both S and R bin positions by overlap depth (overlap depth
on x axis, \# positions with that depth on y axis).  In addition to the base bin
histograms, the algorithm can consider histograms linearly weighted by depth,
and cumulative histograms--where each bin contains all values greater than or
equal to the given depth.  Figure \ref{fig:ovlhistexample} shows examples of 
such histograms.

\begin{figure}
\includegraphics[width=5.5in]{figures/ovlhistexample.png}
\caption{Overlap Histogram Construction.  All histograms constructed from shown alignment.}
\label{fig:ovlhistexample}
\end{figure}

To structure and automate our search for overlaps of interest, we use the following metrics:
\subsubsection{n99}
n99(histogram h1) - Lowest value of x such that the sum of all y values for each lower x in h1 is at least 99 percent of the sum of all y values
if n99(R) >= n99(S) then we consider R to be a similarly significant example, and add one to our fail count.  The p-value is fail-count/trial-count.  
%[example image]

\subsubsection{skew}
skew(histogram h1) - the mathematical skew of h1
if skew(R) >= skew(S) then we consider R to be a similarly significant example, and add one to our fail count.  The p-value is fail-count/trial-count.  
%[example image]

\subsubsection{solomon}
solomon(histogram h1, depth n) -  the sum of the positions of depth greater than n in h1
if solomon(R) >= solomon(S) then we consider R to be a similarly significant example and add one to our fail count.  The p-value is fail-count/trial-count.
%[example image]

\subsubsection{kurtosis}
We include kurtosis not as a direct estimate of the right-tail weight of the distributions, but rather to inform our interpretation of results showing significantly different skew.
kurtosis(histogram h1) - the excess kurtosis of the histogram h1
if kurtosis(R) >= kurtosis(S) then we conside R to be a similarly significant example and add one to our fail count.  The p-value is fail-count/trial-count.  
%[example image]

\subsubsection{Correlation of Comparison Measures}
1. we expected them to be correlated
2. they were
3. we expected them to differ
4. they did
5. look, they are similar enough to be gesturing at the same thing and different enough to be useful in aggregate!
6. figure of the correlation matrix

\subsection{Controlling for GC-content}

Though the random shuffles establish significance regarding the degree of overlap,
there are competing explanations for this significance. We wish to distinguish
between selective pressure favoring rare codons from rareness dictated by
GC content and the constraints of the amino acid sequence. To do this, we produce
a set of negative controls--random reverse translations maintaining similar GC content
to the sample. Sequences for which these controls produce similarly aligned rare
codon clusters are excluded in our search for candidates for co-translational folding.

%Obviously, we need to describe our methods for GC-content and codon-pair bias control

\subsection{Visualization} 

To present the information provided by these measures, we constructed a
multipart visualization as seen in \ref{fig:visualizationexample}. The top
badge shows the inverse logs of the measured p-values for non-cumulative
histograms, the bottom badge shows cumulatively calculated measures.  The blue
bars in the histogram represent the sample values, the candlesticks show the
minimum, first, second, and third quartiles, and maximum of the random
permutation histograms. The bottom chart describes the alignment of orthologous
genes, with areas of rare codons denoted by thickened lines.

\begin{figure}
\includegraphics[width=5.5in]{figures/example_vis.png}
\caption{Example Visualization}
\label{fig:visualizationexample}
\end{figure}

\section{Results}

Here we discuss the computational performance of our tools, and the preliminary
biological results generated by their execution.  Our tools sustain substantial
parallelism and run in diverse environments.

\subsection{Performance}


\subsection{Biological Results}
